[WIP] Return local paths to Common Voice #3664

anton-l · 2022-02-01T21:48:27Z

This is a proposed way of returning the old local file-based generator while keeping the new streaming generator intact.

TODO:

brainstorm a bit more on [Audio] Path of Common Voice cannot be used for audio loading anymore #3663 to see if we can do better
refactor the heck out of this PR to avoid completely copying the logic between the two generators

anton-l · 2022-02-01T21:54:43Z

datasets/common_voice/common_voice.py

+        if archive_iterator is not None:
+            yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips)


The key part: if we get an archive files iterator, then use the new streaming logic, otherwise use the old pre-streaming logic.

patrickvonplaten · 2022-02-01T22:05:42Z

Cool thanks for giving it a try @anton-l !

Would be very much in favor of having "real" paths to the audio files again for non-streaming use cases. At the same time it would be nice to make the audio data loading script as understandable as possible so that the community can easily add audio datasets in the future by looking at this one as an example. Think if it's clear for a contributor how to add an audio datasets script that works for the standard non-streaming case while it is easy to extend it afterwards to a streaming dataset script, then this would be perfect

polinaeterna · 2022-02-07T13:46:27Z

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? 🤔

patrickvonplaten · 2022-02-07T14:39:26Z

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? thinking

Not sure @lhoestq - what do you think?

Now that we've corrected the previous resampling bug, think this one here is of high importance. @lhoestq - what do you think how we should proceed here?

lhoestq · 2022-02-07T20:02:49Z

@anton-l @patrickvonplaten @lhoestq Is it possible somehow to provide this logic inside the library instead of a loading script so that we don't need to completely rewrite all the scripts for audio datasets and users don't have to care about two different loading approaches in the same script? 🤔

Yes let's do this :)

Maybe we can change the behavior of DownloadManager.iter_archive back to extracting the TAR archive locally, and return an iterable of (local path, file obj). And the StreamingDownloadManager.iter_archive can return an iterable of (relative path inside the archive, file obj) ?

In this case, a dataset would need to have something like this:

for path, f in files:
    yield id_, {"audio": {"path": path, "bytes": f.read() if not is_local_file(path) else None}}

Alternatively, we can allow this if we consider that Audio.encode_example sets the "bytes" field to None automatically if path is a local path:

for path, f in files:
    yield id_, {"audio": {"path": path, "bytes": f.read()}}

Note that in this case the file is read for nothing though (maybe it's not a big deal ?)

Let me know if it sounds good to you and what you'd prefer !

anton-l · 2022-02-07T22:43:16Z

@lhoestq I'm very much in favor of your first aproach! With the full paths returned I think we won't even need to mess with os.path.join vs "/".join()" and other local/streaming differences 👍

mariosasko · 2022-02-08T15:11:55Z

@lhoestq I also like the idea and favor your first approach to avoid an unnecessary read and make yielding faster.

patrickvonplaten · 2022-02-08T15:42:22Z

Looks cool - thanks for working on this. I just feel strongly about path being an absolute path that exist and can be inspected in the non-streaming case :-) For streaming=True IMO it's absolutely fine if we only have access to the bytes

…e-paths

lhoestq · 2022-02-08T22:58:44Z

Hi ! I started implementing this but I noticed that returning an absolute path is breaking for many datasets that do things like

for path, f in files:
    if path.startswith(data_dir):
        ...

so I think I will have to add a parameter to iter_archive like extract_locally=True to avoid the breaking change, does that sound good to you ?

This makes me also think that in streaming mode it could also return a local path too, if we think that writing and deleting temporary files on-the-fly while iterating over the streaming dataset is ok.

albertvillanova · 2022-02-09T07:51:45Z

@lhoestq I think it is a good idea to rollback to extracting the archives locally in non-streaming mode, as far as (as you mentioned) we do not store the bytes in the Arrow file for those cases to avoid "doubling" the disk space usage.

On the other hand, I don't like:

neither the possibility to avoid extracting locally in non-streaming: the behavior should be consistent; thus we always extract in non-streaming
- which could be the criterium to decide whether an archive should or should not be extracted? Just because I want to make a condition on path.startswith?
nor the option to download/delete temporary files in streaming (see discussion here: Fix streaming for servers not supporting HTTP range requests #3689 (comment))

Unfortunately, in order to fix the datasets that are breaking after the rollback, I would suggest fixing their scripts so that the paths are handled more robustly (considering that they can be absolute or relative).

anton-l · 2022-02-09T08:58:11Z

I agree with Albert, fixing all of the audio datasets isn't too big of a deal (yet). I can help with those if needed :)

lhoestq · 2022-02-09T23:04:00Z

Ok cool ! I'm completely rolling it back then

lhoestq · 2022-02-09T23:53:37Z

Alright I did the rollback and now you can get local paths :)
Feel free to try it out and let me know if it's good for you

lhoestq · 2022-02-10T01:42:13Z

I'll fix the CI tomorrow x)

…e-paths

lhoestq · 2022-02-10T17:54:34Z

Ok according to the CI there around 60+ datasets to fix

polinaeterna · 2022-02-11T15:48:11Z

fixing all of the audio datasets isn't too big of a deal (yet). I can help with those if needed :)

I can help with them too :)

lhoestq · 2022-02-11T21:07:41Z

lhoestq · 2022-02-14T19:11:56Z

I'll do my best to fix as many as possible tomorrow :)

polinaeterna · 2022-02-15T08:15:46Z

the audio datasets are fixed if I didn't forget anything :)

btw what are we gonna do with the community ones that would be broken after the fix?

lhoestq · 2022-02-22T09:14:05Z

Closing in favor of #3736

Merge generators for local files and streaming

f247982

anton-l requested review from patrickvonplaten and lhoestq February 1, 2022 21:48

anton-l commented Feb 1, 2022

View reviewed changes

lhoestq mentioned this pull request Feb 7, 2022

[Audio] Path of Common Voice cannot be used for audio loading anymore #3663

Closed

lhoestq added 2 commits February 8, 2022 17:35

Merge remote-tracking branch 'upstream/master' into return-commonvoic…

e3fc32a

…e-paths

revert

47a124e

lhoestq added 3 commits February 9, 2022 18:48

rollback to good ol' iter_archive with local paths

614f1c6

set local path

160c507

remove unused import

9c9bb09

lhoestq added 3 commits February 10, 2022 12:04

fix test

a74060f

update for dummy data files

bf0765a

Merge remote-tracking branch 'upstream/master' into return-commonvoic…

3d67e8f

…e-paths

polinaeterna mentioned this pull request Feb 11, 2022

process .opus files (for Multilingual Spoken Words) #3666

Merged

polinaeterna added 3 commits February 14, 2022 16:47

fix air dialogue

d7b8136

fix vivos

929c9dd

fix speech_commands

4037c0c

fix openslr

238360e

polinaeterna added 6 commits February 15, 2022 11:39

fix id_nergrit_corpus

162a8f0

fix imdb

64dfe64

fix klue

827360a

fix lama

57fdb19

fix lex glue

ab10712

fix amazon polarity

c12c040

lhoestq mentioned this pull request Feb 16, 2022

Local paths in common voice #3736

Merged

lhoestq closed this Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Return local paths to Common Voice #3664

[WIP] Return local paths to Common Voice #3664

anton-l commented Feb 1, 2022

anton-l Feb 1, 2022

patrickvonplaten commented Feb 1, 2022

polinaeterna commented Feb 7, 2022 •

edited

Loading

patrickvonplaten commented Feb 7, 2022

lhoestq commented Feb 7, 2022 •

edited

Loading

anton-l commented Feb 7, 2022 •

edited

Loading

mariosasko commented Feb 8, 2022

patrickvonplaten commented Feb 8, 2022

lhoestq commented Feb 8, 2022

albertvillanova commented Feb 9, 2022 •

edited

Loading

anton-l commented Feb 9, 2022

lhoestq commented Feb 9, 2022

lhoestq commented Feb 9, 2022

lhoestq commented Feb 10, 2022

lhoestq commented Feb 10, 2022

polinaeterna commented Feb 11, 2022 •

edited

Loading

lhoestq commented Feb 11, 2022 •

edited by polinaeterna

Loading

lhoestq commented Feb 14, 2022

polinaeterna commented Feb 15, 2022

lhoestq commented Feb 22, 2022

		if archive_iterator is not None:
		yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips)

[WIP] Return local paths to Common Voice #3664

[WIP] Return local paths to Common Voice #3664

Conversation

anton-l commented Feb 1, 2022

anton-l Feb 1, 2022

Choose a reason for hiding this comment

patrickvonplaten commented Feb 1, 2022

polinaeterna commented Feb 7, 2022 • edited Loading

patrickvonplaten commented Feb 7, 2022

lhoestq commented Feb 7, 2022 • edited Loading

anton-l commented Feb 7, 2022 • edited Loading

mariosasko commented Feb 8, 2022

patrickvonplaten commented Feb 8, 2022

lhoestq commented Feb 8, 2022

albertvillanova commented Feb 9, 2022 • edited Loading

anton-l commented Feb 9, 2022

lhoestq commented Feb 9, 2022

lhoestq commented Feb 9, 2022

lhoestq commented Feb 10, 2022

lhoestq commented Feb 10, 2022

polinaeterna commented Feb 11, 2022 • edited Loading

lhoestq commented Feb 11, 2022 • edited by polinaeterna Loading

lhoestq commented Feb 14, 2022

polinaeterna commented Feb 15, 2022

lhoestq commented Feb 22, 2022

polinaeterna commented Feb 7, 2022 •

edited

Loading

lhoestq commented Feb 7, 2022 •

edited

Loading

anton-l commented Feb 7, 2022 •

edited

Loading

albertvillanova commented Feb 9, 2022 •

edited

Loading

polinaeterna commented Feb 11, 2022 •

edited

Loading

lhoestq commented Feb 11, 2022 •

edited by polinaeterna

Loading